1. Business Task

The business task is to analyze the differences in usage patterns between annual members and casual riders of Cyclistic’s bike-share program. This analysis will guide the design of marketing strategies aimed at converting casual riders into annual members, thereby maximizing the profitability of the Cyclistic program.

2. Understand the Dataset

The analysis was conducted using Cyclistic’s historical bike trip data. The dataset includes the following attributes:

  • ride_id: Unique identifier for each ride.
  • rideable_type: Type of bike used.
  • started_at and ended_at: Start and end times of each trip.
  • start_station_name and end_station_name: Start and end stations of the trip.
  • start_lat and start_lng, end_lat and end_lng: Latitude and longitude of start and end points.
  • member_casual: Indicates whether the rider is a “member” or “casual”.

3. Data Preparation

Load and Combine Monthly Data from January 2024 to December 2024.

# List of monthly data file paths
file_paths <- list(
  "divvy-tripdata/202401-divvy-tripdata.csv",
  "divvy-tripdata/202402-divvy-tripdata.csv",
  "divvy-tripdata/202403-divvy-tripdata.csv",
  "divvy-tripdata/202404-divvy-tripdata.csv",
  "divvy-tripdata/202405-divvy-tripdata.csv",
  "divvy-tripdata/202406-divvy-tripdata.csv",
  "divvy-tripdata/202407-divvy-tripdata.csv",
  "divvy-tripdata/202408-divvy-tripdata.csv",
  "divvy-tripdata/202409-divvy-tripdata.csv",
  "divvy-tripdata/202410-divvy-tripdata.csv",
  "divvy-tripdata/202411-divvy-tripdata.csv",
  "divvy-tripdata/202412-divvy-tripdata.csv"
)

# Load and combine data using data.table
monthly_data <- rbindlist(lapply(file_paths, fread))

3.1 Data Cleaning

The number of rows before cleaning the data: 5860568. To ensure accuracy and relevance, the following cleaning and transformation steps were performed.

Check for Missing Values (N/A)

# Check for missing values
missing_summary <- colSums(is.na(monthly_data))
missing_summary
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                  0                  0 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               7232               7232 
##      member_casual 
##                  0

The data shows that end latitude and longitude of rows 7232 is missing. We can get the missing geo data from rows with complete data of these stations.

Check for blank values in key variables

  1. The number of rows missing data in start latitude column: 0.
  2. The number of rows missing in start longitude column: 0.
  3. The number of rows missing in start station name column: 1073951.
  4. The number of rows missing in start station id field: 1073951.
  5. The number of rows missing in end latitude column: 7232.
  6. The number of rows missing in end longitude column: 7232`.
  7. The number of rows missing in station name column: 1104653.
  8. The number of rows missing in end station id column: 1104653.
  9. The number of rows missing start time: 0.
  10. The number of rows missing end time: 0.
  11. The number of rows missing member type: 0.
  12. The number of rows missing bike type: 0.

Negative longitudes

The longitude of the Chicago area is negative because it is located in the Western Hemisphere. We shall check if there are positive values in longitudes column and change to negative. After checking the data, we found:
There were No positive longitudes.

# Check if there are positive longitudes
has_positive_longitudes <- any(monthly_data$start_lng > 0) & any(monthly_data$end_lng > 0)

# Print the result
if (has_positive_longitudes) {
  monthly_data[, start_lng := -abs(start_lng)]
  monthly_data[, end_lng := -abs(end_lng)]
}

Get stations table

Missing data are station names and station geo data. To fill the data, we shall use rows that don’t have missing station names and station geo data.

  1. Copy records that don’t contain missing or blank values to non_missing table
# Ensure monthly_data is a data.table
setDT(monthly_data)
# Remove duplicates
monthly_data <- unique(monthly_data, by = "ride_id")
# Filter rows where the specified columns are not missing or blank
non_missing <- monthly_data[
  !is.na(start_station_name) & start_station_name != "" &
  !is.na(start_station_id) & start_station_id != "" &
  !is.na(end_station_name) & end_station_name != "" &
  !is.na(end_station_id) & end_station_id != "" &
  !is.na(end_lat) & end_lat != "" &
  !is.na(end_lng) & end_lng != ""
]

The number of rows that have non-missing data: 4208188.

  1. Get each station with its geo data.
# Group by `start_station_id` and `start_station_name`, and calculate the average latitude and longitude
start_station_coords <- non_missing[, .(
  station_lat = mean(start_lat),  # Calculate average latitude
  station_lng = mean(start_lng)   # Calculate average longitude
), by = .(start_station_id, start_station_name)]

# Group by `end_station_name` and calculate the average latitude and longitude
end_station_coords <- non_missing[, .(
  station_lat = mean(end_lat),  # Calculate average latitude
  station_lng = mean(end_lng)   # Calculate average longitude
), by = .(end_station_id, end_station_name)]

# Rename columns for merging
setnames(start_station_coords, "start_station_name", "station_name")
setnames(end_station_coords, "end_station_name", "station_name")

setnames(start_station_coords, "start_station_id", "station_id")
setnames(end_station_coords, "end_station_id", "station_id")

start_station_coords <- unique(start_station_coords)
end_station_coords <- unique(end_station_coords)
  1. Combine unique coordinates from both start and end stations.
# Combine unique coordinates from both start and end stations
all_station_coords <- rbind(start_station_coords, end_station_coords)
  1. To handle multiple latitude and longitude values for the same station, we compute the average latitude and longitude for each station type.
# Aggregate latitude and longitude by station_name
stations <- all_station_coords[, .(
  station_lat = mean(station_lat, na.rm = TRUE),    # Calculate average latitude
  station_lng = mean(station_lng, na.rm = TRUE)     # Calculate average longitude
), by = .(station_id,station_name)]

# Ensure `stations` has unique rows by `station_id`
stations <- unique(stations, by = "station_id")

The number of stations are: 1763.

  1. Find station that are only present in monthly data and not in stations table.
library(data.table)

# Ensure data.tables
setDT(monthly_data)
setDT(stations)


# Find `start_station_id`s in `monthly_data` not in `stations`
missing_start_station_ids <- unique(monthly_data[!start_station_id %in% stations$station_id & !is.na(start_station_id) & start_station_id != "", start_station_id])

# Find `end_station_id`s in `monthly_data` not in `stations`
missing_end_station_ids <- unique(monthly_data[!end_station_id %in% stations$station_id & !is.na(end_station_id) & end_station_id != "", end_station_id])

# Combine the results for both start and end stations
missing_station_ids <- unique(c(missing_start_station_ids, missing_end_station_ids))

Number of stations missing in station table are: 11.

  1. Add missing stations to station table.
library(data.table)

# Ensure data.tables
setDT(monthly_data)
setDT(stations)

# Rename the `stations` columns for clarity
# setnames(stations, c("id", "name", "lat", "lng"),
#          c("station_id", "station_name", "station_lat", "station_lng"))

# Identify missing start_station_id
missing_start_stations <- monthly_data[
  !start_station_id %in% stations$station_id & !is.na(start_station_id) & start_station_id != "",
  .(station_id = start_station_id, 
    station_name = start_station_name, 
    station_lat = start_lat, 
    station_lng = start_lng)
]

# Identify missing end_station_id
missing_end_stations <- monthly_data[
  !end_station_id %in% stations$station_id & !is.na(end_station_id) & end_station_id != "",
  .(station_id = end_station_id, 
    station_name = end_station_name, 
    station_lat = end_lat, 
    station_lng = end_lng)
]

# Combine the missing stations
missing_stations <- unique(rbind(missing_start_stations, missing_end_stations))

# Add the missing stations to the `stations` table
stations <- rbind(stations, missing_stations, use.names = TRUE, fill = TRUE)

stations <- unique(stations, by = "station_id")

The new number of stations are: 1774.

  1. Confirm if there still stations that are only present in monthly data and not in stations table.
library(data.table)

# Ensure data.tables
setDT(monthly_data)
setDT(stations)


# Find `start_station_id`s in `monthly_data` not in `stations`
missing_start_station_ids <- unique(monthly_data[!start_station_id %in% stations$station_id & !is.na(start_station_id) & start_station_id != "", start_station_id])

# Find `end_station_id`s in `monthly_data` not in `stations`
missing_end_station_ids <- unique(monthly_data[!end_station_id %in% stations$station_id & !is.na(end_station_id) & end_station_id != "", end_station_id])

# Combine the results for both start and end stations
missing_station_ids <- unique(c(missing_start_station_ids, missing_end_station_ids))

Number of stations still missing are: 0.

Matching stations

We perform a “rough match” to match latitude and longitude data between the monthly_data table and the stations table. Since the latitude and longitude data in monthly_data is varied and the stations table contains averaged coordinates, we use a tolerance range for matching. We shall use Tolerance range from 0.0001 to 0.01

Justification for using Tolerance (0.0001 to 0.01)

Using a tolerance range from 0.0001 to 0.01 ensures flexibility in matching latitude and longitude data between the data sets. This range progressively balances precision and completeness. A tolerance of 0.003 (~333 meters) effectively captures most valid matches in a densely populated urban area like Chicago without over-matching, accommodating minor variations in GPS data. Expanding the tolerance incrementally up to 0.01 (~1.1 km) accounts for wider spatial variations, ensuring accuracy for stations spaced farther apart while maintaining reliable data alignment.

First Matching Pass (0.0001)
# Define tolerance for latitude and longitude
tolerance <- 0.0001

# Rough match function with a tolerance of 0.0001
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.0001.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 81.8429833.
  • The Percentage of resolved end stations: 81.323834.
Second Matching Pass (0.0002)
# Define tolerance for latitude and longitude
tolerance <- 0.0002

# Rough match function with a tolerance of 0.0002
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.0002.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 81.8954374.
  • The Percentage of resolved end stations: 81.3768513.
Third Matching Pass (0.0003)
# Define tolerance for latitude and longitude
tolerance <- 0.0003

# Rough match function with a tolerance of 0.0003
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.0003.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 81.9882816.
  • The Percentage of resolved end stations: 81.4699343.
Fourth Matching Pass (0.0005)
# Define tolerance for latitude and longitude
tolerance <- 0.0005

# Rough match function with a tolerance of 0.0005
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.0005.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 82.1165161.
  • The Percentage of resolved end stations: 81.6022812.
Fifth Matching Pass (0.001)
# Define tolerance for latitude and longitude
tolerance <- 0.001

# Rough match function with a tolerance of 0.001
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.001.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 85.1388917.
  • The Percentage of resolved end stations: 84.6430687.
Sixth Matching Pass (0.002)
# Define tolerance for latitude and longitude
tolerance <- 0.002

# Rough match function with a tolerance of 0.002
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.002.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 92.8226045.
  • The Percentage of resolved end stations: 92.4825911.
Seventh Matching Pass (0.003)
# Define tolerance for latitude and longitude
tolerance <- 0.003

# Rough match function with a tolerance of 0.003
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.003.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 96.8251422.
  • The Percentage of resolved end stations: 96.5735705.
Eighth Matching Pass (0.005)
# Define tolerance for latitude and longitude
tolerance <- 0.005

# Rough match function with a tolerance of 0.005
match_lat_lng <- function(lat, lng, stations) {
  matched_station <- stations[
    abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
  ]
  
  if (nrow(matched_station) == 1) {
    return(matched_station[1]) # Return the matched row if there's exactly one match
  } else if (nrow(matched_station) > 1) {
    # If multiple matches, return the nearest one
    matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
    return(matched_station[which.min(dist)])
  } else {
    # No match found
    return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
  }
}

# Match logic with fallback to original values
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == ""),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Match logic with fallback to original values for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == ""),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.005.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 99.3869315.
  • The Percentage of resolved end stations: 99.2319069.

Examine remaining data that have missing stations

After quick observation, data with missing stations fall under 2 categories:

  1. Rows whose latitude and longitude are of 2 decimal places.
  2. Rows whose latitudes and longitudes are missing.

Solution: Since the precision is low, we increase the tolerance for these specific rows to 0.01 accommodate the lack of decimal detail.

Identify Rows with 2 Decimal Place latitude and longitude

two_decimal_start <- monthly_data[
  !is.na(start_lat) & !is.na(start_lng) & 
   (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) |start_station_name == "") &
  (round(start_lat, 2) == start_lat & round(start_lng, 2) == start_lng)
]
two_decimal_end <- monthly_data[
  !is.na(end_lat) & !is.na(end_lng) & 
   (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) |end_station_name == "") &
  (round(end_lat, 2) == end_lat & round(end_lng, 2) == end_lng)
]
  • For data with missing start stations,35928 rows out of 35928 rows have (lat and long) with 2 decimal places.
  • For data with missing end stations,37821 rows out of 45013 rows have (lat and long) with 2 decimal places.
Last Matching Pass (0.01)
# Define tolerance for rows with 2 decimals
tolerance <- 0.01

# Identify rows with lat/lng rounded to 2 decimals
monthly_data[
  (is.na(start_station_id) | start_station_id == "") & 
  (is.na(start_station_name) | start_station_name == "") & 
  (round(start_lat, 2) == start_lat & round(start_lng, 2) == start_lng),
  c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
    match <- match_lat_lng(start_lat, start_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
    )
  },
  by = .(start_lat, start_lng)
]

# Repeat the process for end stations
monthly_data[
  (is.na(end_station_id) | end_station_id == "") & 
  (is.na(end_station_name) | end_station_name == "") & 
  (round(end_lat, 2) == end_lat & round(end_lng, 2) == end_lng),
  c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
    match <- match_lat_lng(end_lat, end_lng, stations)
    list(
      station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
      station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
      station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
      station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
    )
  },
  by = .(end_lat, end_lng)
]

The percentage of stations resolved using tolerance of 0.01.

total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")
  • The Percentage of resolved start stations: 99.9829362.
  • The Percentage of resolved end stations: 99.8444463.

Comparing complete and incomplete data

unmatched_start <- monthly_data[
  (is.na(start_station_id) | start_station_id == "")
]

unmatched_end <- monthly_data[
  (is.na(end_station_id) | end_station_id == "")
]

unmatched_stations <- monthly_data[
  is.na(start_station_id) | start_station_id == "" | is.na(end_station_id) | end_station_id == ""
]
unmatched_rows <- nrow(unmatched_stations)


matched_stations <- monthly_data[
  !is.na(start_station_id) & start_station_id != "" & !is.na(end_station_id) & end_station_id != ""
]
matched_rows <- nrow(matched_stations)
  • There are 9939 rows with incomplete data. This is 0.1695972% of the total data.
  • There are 5850418 rows with complete data. This is 99.8304028%of the total data.

Removing rows with incomplete data

Since rows with incomplete data are less than 1%, removing them will not have that much effect on the results of the data analysis.

monthly_data <- matched_stations

Latitude and Longitude

To standardize the latitude and longitude of stations in the monthly_data table based on the stations table, we overwrite the latitude and longitude values in monthly_data with those from the stations table. This ensures that each station in monthly_data has a single, consistent set of coordinates.

library(data.table)

# Ensure data.tables
setDT(monthly_data)
setDT(stations)

# Ensure `monthly data` has unique rows by `ride_id`
monthly_data <- unique(monthly_data, by = "ride_id")

# Standardize start station latitude and longitude
monthly_data <- merge(
  monthly_data,
  stations[, .(station_id, station_lat, station_lng)],
  by.x = "start_station_id",
  by.y = "station_id",
  all.x = TRUE
)

# Replace start_lat and start_lng with consistent values from stations
monthly_data[, start_lat := station_lat]
monthly_data[, start_lng := station_lng]

# Drop the merged station columns
monthly_data[, c("station_lat", "station_lng") := NULL]

# Standardize end station latitude and longitude
monthly_data <- merge(
  monthly_data,
  stations[, .(station_id, station_lat, station_lng)],
  by.x = "end_station_id",
  by.y = "station_id",
  all.x = TRUE
)

# Replace end_lat and end_lng with consistent values from stations
monthly_data[, end_lat := station_lat]
monthly_data[, end_lng := station_lng]

# Drop the merged station columns
monthly_data[, c("station_lat", "station_lng") := NULL]

The result of cleaning monthly data table:

  1. The number of rows missing in start latitude column: 0.
  2. The number of rows missing in start longitude column: 0.
  3. The number of rows missing in start station name column: 0.
  4. The number of rows missing in start station id field: 0.
  5. The number of rows missing in end latitude column: 0.
  6. The number of rows missing in end longitude column: 0.
  7. The number of rows missing in station name column: 0.
  8. The number of rows missing in end station id column: 0.

Proper Date and Time Formatting

Converted started_at and ended_at columns to POSIXct for accurate time calculations.

# Convert date/time fields to proper format
monthly_data[, started_at := as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S")]
monthly_data[, ended_at := as.POSIXct(ended_at, format = "%Y-%m-%d %H:%M:%S")]

Remove Duplicate Records

Duplicate rows were removed to avoid skewed results.

# Ensure `monthly data` has unique rows by `ride_id`
monthly_data <- unique(monthly_data, by = "ride_id")

3.2 Data Manipulation

Trip Duration

Calculated as the difference between ended_at and started_at.

# Calculate trip duration in minutes
monthly_data[, trip_duration := as.numeric(difftime(ended_at, started_at, units = "mins"))]

Day of Week

Extracted the weekday from started_at.

# Extract the day of the week
monthly_data[, day_of_week := weekdays(started_at)]

Hour of the Day

Extracted the hour from started_at.

# Extract the hour from the start time
monthly_data[, hour_of_day := as.numeric(format(started_at, "%H"))]

4. Analyze the Data

Analyzed total rides, popular days, months, start stations, and end stations for both user types.

4.1 Aggregate Metrics

Total Number of Rides by User Type

Table of Rides
# Total number of rides by user type
total_rides <- monthly_data[, .N, by = member_casual]
setnames(total_rides, "N", "TotalRides") # Rename column for clarity

# Calculate percentage
total_rides[, Percentage := round((TotalRides / sum(TotalRides)) * 100, 1)]

# Styled table with kableExtra
total_rides %>%
  knitr::kable(
    col.names = c("User Type", "Total Rides", "Percentage (%)"),
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
User Type Total Rides Percentage (%)
casual 2144200 36.7
member 3706218 63.3
Bar Graph
plot_ly(
  data = total_rides,
  x = ~member_casual,
  y = ~Percentage,
  type = "bar",
  text = ~paste(Percentage, "%"),
  textposition = "outside",
  marker = list(color = c("#1f77b4", "#ff7f0e"))
) %>%
  layout(
    title = "Percentage of Rides by User Type",
    xaxis = list(title = "User Type"),
    yaxis = list(title = "Percentage (%)"),
    showlegend = FALSE
  )

Average Trip Duration by User Type

Members take more rides overall but have shorter trip durations on average.

Member Table
# Average trip duration by user type
avg_duration <- monthly_data[, .(avg_duration = mean(trip_duration, na.rm = TRUE)), by = member_casual]

# Average trip duration table
avg_duration %>%
  knitr::kable(
    col.names = c("User Type", "Average Duration (minutes)"),
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
User Type Average Duration (minutes)
casual 21.09785
member 12.22509
Bar Graph
# Interactive Plot
plot_ly(avg_duration, x = ~member_casual, y = ~avg_duration, type = 'bar', color = ~member_casual) %>%
  layout(title = "Average Trip Duration by User Type",
         xaxis = list(title = "User Type"),
         yaxis = list(title = "Average Duration (minutes)"))

4.2 Analyze Usage Patterns

Peak Hours

In the weekdays, member rides peak during morning hours (5 am - 8 am) and evening hours (3 pm - 6 pm). In the weekends, peak hours are between 9 am to 6 pm.

For casual riders, peak hours in the weekdays are between 3 pm to 6 pm. In the weekends, peak hours are between 8 am to 6 pm.

  • Casual Riders: Ride primarily on weekends and during leisure hours.
  • Members: Use bikes more consistently across the week, especially during commute hours.
Heatmap
# Aggregate data for peak hours and days
heatmap_data <- monthly_data[, .(Total_Rides = .N), by = .(hour_of_day, day_of_week, member_casual)]
heatmap_data[, day_of_week := factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))]

# Create heatmap for member riders
member_heatmap <- heatmap_data[member_casual == "member"]
member_plot <- plot_ly(
  data = member_heatmap,
  x = ~hour_of_day,
  y = ~day_of_week,
  z = ~Total_Rides,
  type = "heatmap",
  colorscale = list(c(0, "lightblue"), c(1, "darkblue")),
  hoverinfo = "x+y+z",
  showscale = TRUE
) %>%
  layout(
    xaxis = list(title = "Hour of Day"),
    yaxis = list(title = "Day of Week"),
    colorbar = list(title = "Total Rides")
  )

# Create heatmap for casual riders
casual_heatmap <- heatmap_data[member_casual == "casual"]
casual_plot <- plot_ly(
  data = casual_heatmap,
  x = ~hour_of_day,
  y = ~day_of_week,
  z = ~Total_Rides,
  type = "heatmap",
  colorscale = list(c(0, "lightpink"), c(1, "red")),
  hoverinfo = "x+y+z",
  showscale = TRUE
) %>%
  layout(
    xaxis = list(title = "Hour of Day"),
    yaxis = list(title = "Day of Week"),
    colorbar = list(title = "Total Rides")
  )

# Combine the plots with labels
subplot(member_plot, casual_plot, nrows = 2, shareX = TRUE, titleX = TRUE, titleY = TRUE) %>%
  layout(
    title = "Peak Hours by User Type",
    annotations = list(
      list(
        x = 0.5,
        y = 1,
        text = "Member Riders",
        showarrow = FALSE,
        xref = "paper",
        yref = "paper",
        font = list(size = 14, color = "yellow")
      ),
      list(
        x = 0.5,
        y = 0.5,
        text = "Casual Riders",
        showarrow = FALSE,
        xref = "paper",
        yref = "paper",
        font = list(size = 14, color = "red")
      )
    )
  )
Bar Graph
# Calculate average total rides for each hour across 12 months
peak_hours <- monthly_data[, .(Avg_Total_Rides = mean(.N, na.rm = TRUE)), 
                           by = .(hour_of_day, member_casual)]

# Rename column for clarity
setnames(peak_hours, "Avg_Total_Rides", "TotalRides")

# Sort data by Total Rides in descending order
peak_hours <- peak_hours[order(-TotalRides)]



# Aggregate data for peak hours
hourly_data <- monthly_data[, .N, by = .(hour_of_day, member_casual)]

# Create a bar graph with plot_ly
plot_ly(
  data = hourly_data,
  x = ~hour_of_day,
  y = ~N,
  color = ~member_casual,
  type = "bar",
  hoverinfo = "x+y+name",
  barmode = "group"
) %>%
  layout(
    title = "Peak Hours by User Type",
    xaxis = list(title = "Hour of Day", tickmode = "array", tickvals = 0:23),
    yaxis = list(title = "Number of Rides"),
    legend = list(title = list(text = "User Type")),
    margin = list(t = 50, b = 50, l = 50, r = 50)
  )

Casual riders to watch

During the week days, there is a good number of casual riders who use bikes in the mornings (5 am - 8 am) and in the evenings (3 pm - 6 pm). This group of casual riders have similar pattern to member riders in terms of riding hours. It is therefore worthwhile to have target advert for this group in order to convince them to join member subscription.

# Step 1: Add filters for casual riders, 5 AM to 8 AM, Monday to Friday
filtered_data_am <- monthly_data[
  member_casual == "casual" &
  hour(started_at) >= 5 & hour(started_at) < 8 & # Between 5 AM and 8 AM
  weekdays(started_at) %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") # Weekdays only
]

# Step 2: Add filters for casual riders, 3 PM to 6 PM, Monday to Friday
filtered_data_pm <- monthly_data[
  member_casual == "casual" &
  hour(started_at) >= 15 & hour(started_at) < 18 & # Between 5 PM and 6 PM
  weekdays(started_at) %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") # Weekdays only
]
casual_all <- monthly_data[
  member_casual == "casual"
]

# Step 3: Count rides by start station for 5 AM to 8 AM
casual_riders_start_stations_am <- filtered_data_am[, .N, by = .(start_station_name)]
setnames(casual_riders_start_stations_am, "N", "TotalRides")

# Step 4: Count rides by start station for 3 PM to 6 PM
casual_riders_start_stations_pm <- filtered_data_pm[, .N, by = .(start_station_name)]
setnames(casual_riders_start_stations_pm, "N", "TotalRides")

# Step 5: Order by Total Rides in descending order for both time periods
casual_riders_start_stations_am <- casual_riders_start_stations_am[order(-TotalRides)]
casual_riders_start_stations_pm <- casual_riders_start_stations_pm[order(-TotalRides)]

# Filter the top 10 rows for both time periods
top_10_casual_riders_start_stations_am <- casual_riders_start_stations_am[1:10]
top_10_casual_riders_start_stations_pm <- casual_riders_start_stations_pm[1:10]

# Step 6: Display the tables using kableExtra
# Table for 5 AM to 8 AM
top_10_casual_riders_start_stations_am_table <- top_10_casual_riders_start_stations_am %>%
  knitr::kable(
    col.names = c("Start Station", "Total Rides"), # Table column names
    caption = "Top 10 Start Stations for Casual Riders (5 AM to 8 AM, Mon-Fri)", # Table caption
    format = "html"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style Start Station column
  column_spec(2, color = "black", background = "lightgray") # Style Total Rides column

# Table for 3 PM to 6 PM
top_10_casual_riders_start_stations_pm_table <- top_10_casual_riders_start_stations_pm %>%
  knitr::kable(
    col.names = c("Start Station", "Total Rides"), # Table column names
    caption = "Top 10 Start Stations for Casual Riders (3 PM to 6 PM, Mon-Fri)", # Table caption
    format = "html"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  column_spec(1, bold = TRUE, color = "white", background = "blue") %>% # Style Start Station column
  column_spec(2, color = "black", background = "lightgray") # Style Total Rides column

This group represent 20.4554612% of casual riders.

The morning table
Top 10 Start Stations for Casual Riders (5 AM to 8 AM, Mon-Fri)
Start Station Total Rides
Dearborn Pkwy & Delaware Pl 1203
Kingsbury St & Kinzie St 873
Wells St & Elm St 865
Clark St & Elm St 726
Canal St & Madison St 663
Clark St & North Ave 649
New St & Illinois St 646
Stockton Dr & Wrightwood Ave 632
Clinton St & Washington Blvd 626
Kingsbury St & Erie St 599
The evening table
Top 10 Start Stations for Casual Riders (3 PM to 6 PM, Mon-Fri)
Start Station Total Rides
Streeter Dr & Grand Ave 10501
DuSable Lake Shore Dr & Monroe St 6377
Canal St & Adams St 4496
LaSalle St & Illinois St 4253
Kingsbury St & Kinzie St 3929
Dearborn St & Adams St 3904
Michigan Ave & Oak St 3820
Shedd Aquarium 3768
Millennium Park 3722
New St & Illinois St 3506

6. Recommendations

Target Casual Riders with Similar Behavior to Members

I. Focus on casual riders who commute on weekdays during morning hours (5 AM and 8 AM) and evening hours ( 3PM - 6 PM).
II. Offer incentives like discounted memberships emphasizing commuting benefits.

Seasonal Promotions

I. Launch campaigns during the summer months to convert recreational casual riders into annual members.
II. Highlight membership savings for frequent users.

Station-Based Marketing

I. Use geotargeted advertisements at popular casual rider stations (e.g., Streeter Dr & Grand Ave) promoting annual memberships and their benefits.